Overview

  • Reference genomes and GRC.
  • Fasta and FastQ (Unaligned sequences).
  • SAM/BAM (Aligned sequences).
  • BED (Genomic Intervals).
  • GFF/GTF (Gene annotation).
  • Wiggle files, BEDgraphs and BigWigs (Genomic scores).
  • VCF and MAF (Genomic variations).

class: inverse, center, middle

Reference Genomes


Are there we there yet?

  • The human genome isnt complete!
  • In fact, most model organisms’s reference genomes are being regularly updated.
  • Reference genomes consist of mixture of known chromosomes and unplaced contigs called a Genome Reference Assembly.
  • Major revisions to assembies result in change of co-ordinates.
  • Requires conversion between revisions.
  • The latest genome assembly for humans is GRCh38.
  • Patches add information to the assembly without disrupting the chromosome coordinates . i.e GRCh38.p3

Genome Reference Consortium.

  • GRC is collaboration of institutes which curate and maintain the reference genomes for 3 model organims.
    • Human - GRCh38.p3
    • Mouse - GRCm38.p3
    • Zebrafish - GRCz10
  • Other model organisms are maintained separately.
    • Drosophila - Berkeley Drosophila Genome Project, BDGP36

Why do we need to know about reference genomes

  • Allows for genes and genomic features to be evaluated in their linear genomic context.
    • Gene A is close to Gene B
    • Gene A and Gene B are within feature C.
  • Can be used to align shallow targeted high-thoughput sequencing to a pre-built map of an organisms genome.

Aligning to a reference genomes

A reference genome

  • A reference genome is a collection of contigs.
  • A contig is a stretch of DNA sequence encoded as A,G,C,T,N.
  • Typically comes in FASTA format.
    • “>” line contains information on contig
    • Lines following contain contig sequence

igv

class: inverse, center, middle

High-throughput Sequencing formats.


High-throughput Sequencing formats

  • Unaligned sequence files generated from HTS machines are mapped to a reference genome to produce aligned sequence files.
    • FASTQ - Unaligned sequences
    • SAM - Aligned sequences

Unaligned Sequences

FastQ (FASTA with Qualities)

igv

  • “@” followed by identifier.
  • Sequence information.
  • “+”
  • Quality scores encodes as ASCI.

Unaligned Sequences

FastQ - Header

igv

  • Header for each read can contain additional information
    • HS2000-887_89 - Machine name.
    • 5 - Flowcell lane.
    • /1 - Read 1 or 2 of pair (here read 1)

Unaligned Sequences

FastQ - Qualities

igv

  • Qualities follow “+” line.
  • -log10 probability of sequence base being wrong.
  • Encoded in ASCI to save space.
  • Used in quality assessment and downstream analysis

Aligned sequences

SAM format

  • SAM - Sequence Alignment Map.
  • Standard format for sequence data
  • Recognised by majority of software and browsers.

Aligned sequences

SAM header

.pull-left[

igv

] .pull-right[ - SAM header contains information on alignment and contigs used. - @HD - Version number and sorting information - @SQ - Contig/Chromosome name and length of sequence. ]

Aligned sequences

SAM - Aligned reads

igv

  • Contains read and alignment information and location

Aligned sequences

SAM - Aligned reads

igv

  • Read name.
  • Sequence of read.
  • Encoded sequence quality.

Aligned sequences

SAM - Aligned reads

igv

  • Chromosome to which read aligns.
  • Position in chromosome to which 5’ of read aligns.
  • Alignment information - “Cigar string”.
    • 100M - Continuous match of 100 bases
    • 28M1D72M - 28 bases continuously match, 1 deletion from reference, 72 base match

Aligned sequences

SAM - Aligned reads

igv

class: inverse, center, middle

Summarised Genomic Features formats.


Summarised Genomic Features formats

  • Post alignment, sequences reads are typically summarised into scores over/within genomic intervals.
    • BED - Genomic intervals and information.
    • Wiggle/BedGraph - Genomic intervals and scores.
    • GFF - Genomic annotation with information and scores

Summarising in genomic intervals.

** BED format (BED) **

igv

  • Simple format
  • 3 tab separated columns
  • Chromsome, start, end

Summarising in genomic intervals.

** BED format (BED6) **

igv

  • Chromosome, start, end
  • Identifier
  • Score
  • Strand (“.” for strandless)

Summarising in genomic intervals.

** narrowPeak and broadPeak**

  • narrowPeak and broadPeak are extensions to BED6 used in Encode’s peak calling.
  • Contains p-values, q-values.
  • narrowPeak - BED 6+4
  • broadPeak - BED6+3

Signal at genomic positions

  • Common practice to review signal over genome.
  • Special formats exist for this
    • Wiggle
    • bedGraph

Signal at genomic positions

.pull-left[

igv

] .pull-right[ - Information line - Chromosome - Step size - Step start position - Score]

Signal at genomic positions

bedGraph

.pull-left[

igv

] .pull-right[ - BED 3 format - Chromosome - Start - End 4th column - Score]

class: inverse, center, middle

Genomic Annotation.


Genomic Annotation

GFF

igv

  • Used to genome annotation.
  • Stores position, feature (exon) and meta-feature (transcript/gene) information.

Genomic Annotation

igv

  • Chromosome
  • Start of feature
  • End of Feature
  • Strand

Genomic Annotation

igv

  • Source
  • Feature type
  • Score

Genomic Annotation

igv

  • Column 9 contains key pairs (ID=exon01), separated by semi-colons “;”
  • ID - Feature name.
  • PARENT- Meta-feature name.

Genomic Variants

  • Variant Call Format (VCF)
  • Mutation Annotation Format (MAF)

Variant Call Format

Variant Call Format (VCF) is a text file format (most likely stored in a compressed manner). It contains

  • meta-information lines
  • a header line
  • data lines each containing information about a position in the genome

The format also has the ability to contain genotype information on samples for each position.

VCF Structure

datasetSource

Mutation Annotation Format (MAF)

Mutation Annotation Format (MAF) is a tab-delimited text file with aggregated mutation information from VCF files.

MAF Structure

datasetSource

class: inverse, center, middle

Genomic Files for computing .


bigWig, bigBED and TABIX

  • Many programs and browsers deal better with compressed, indexed versions of genomic files
    • SAM -> BAM (.bam and index file of .bai)
    • Wiggle and bedGraph -> bigWig (.bw/.bigWig)
    • BED -> bigBed (.bb)
    • BED, VCF and GFF -> (.gz and index file of .tbi)